Need for speed in accurate whole-genome data analysis: GENALICE MAP challenges BWA/GATK more than PEMapper/PECaller and Isaac.

نویسندگان

  • Michel Plüss
  • Anna M Kopps
  • Irene Keller
  • Janine Meienberg
  • Sylvan M Caspar
  • Nicolo Dubacher
  • Rémy Bruggmann
  • Manfred Vogel
  • Gabor Matyas
چکیده

In the current high-throughput genomics era, efficient and accurate analysis of large-scale whole-genome sequencing (WGS) data constitutes a computational bottleneck. Johnston et al. (1) introduce the PEMapper/ PECaller software package for short-read WGS alignment and variant calling, promising faster analyses with reduced output file sizes and “nearly identical (or better)” variant calling accuracy compared with the de facto standard Burrows–Wheeler aligner/Genome Analysis Toolkit (BWA/GATK) best-practices pipeline (2). However, we cannot confirm this promised BWA/ GATK-like accuracy of PEMapper/PECaller, and there are other pipelines offering ultrafast WGS data analyses with small disk footprints, as we show in this correspondence. To assess sensitivity/recall, precision, computation time, and disk footprint of four corresponding pipelines, we performed alignment and variant calling for the reference short-read WGS data of NA12878 and the Ashkenazim trio (3, 4). The four pipelines included the downloadable PEMapper/ PECaller (1) and BWA/GATK (2) as well as the commercially available Isaac (5) and GENALICE MAP (genalice.com) software packages (versions and settings specified in Fig. 1). To largely reduce systematic errors and alignment artifacts, we limited our benchmarking of whole-genome variant calling to the coding part of the high-confidence BED file of GIAB 3.3 (https://github.com/genome-in-a-bottle), excluding exons with mappability <1, differences between GRCh37 and GRCh38, and/or common copy number variations (CNVs) (6). In our benchmarking, PEMapper/PECaller was, although powerful, neither the fastest pipeline (Fig. 1) nor as sensitive in variant calling as BWA/GATK (Fig. 2A). Indeed, PEMapper/PECaller resulted in the highest number of false-negative calls (Fig. 2A), making it less suitable for clinical sequencing. As expected, BWA/GATK showed the highest sensitivity but fell behind the other three pipelines regarding run time and disk footprint. GENALICE MAP showed sensitivity comparable to BWA/GATK (Fig. 2A) but with a 112× faster total run time and a 45× lower disk footprint (Fig. 1). In precision, only minor differences were observed among pipelines, except for the PEMapper/PECaller population calling and the GENALICE MAP single-sample calling pipelines, which performed with the lowest and with distinctly lower precision, respectively, using downloaded FASTQ files (Fig. 2B). The difference between downloaded and our in-house data was pronounced in the sensitivity of the PEMapper/PECaller singlesample pipeline as well (Fig. 2A), suggesting considerable influence of input sequencing reads on PEMapper/ PECaller. However, although the here-applied reference datasets may have been used for pipeline optimization, there are no alternative/unbiased whole-genome truth sets available for benchmarking. Moreover, PEMapper/PECaller does not output BAM files, which are particularly useful in clinical sequencing for evaluating called variants and in CNV detection. Regarding run time, BWA/GATK might soon catch up with PEMapper/ PECaller if the upcoming GATK version 4.0 is indeed 5× faster as announced or might even be faster if accelerated by the DRAGEN platform (edicogenome. com) or compressive methods such as CORA (7). Impressively, GENALICE MAP has already achieved ultrarapid speed and superior low disk footprint with BWA/ GATK-like sensitivity, thus enabling efficient (re)analyses of ever-increasing amounts of WGS data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PEMapper / PECaller: A simplified approach to whole-genome sequencing

41 The analysis of human whole-genome sequencing data presents significant computational 42 challenges. The sheer size of datasets places an enormous burden on computational, disk array, 43 and network resources. Here we present an integrated computational package, 44 PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk 45 arrays, create output files tha...

متن کامل

Reply to Plüss et al.: The strength of PEMapper/PECaller lies in unbiased calling using large sample sizes.

In a recent Letter in PNAS (1), Plüss et al. compare the speed and accuracy of the Burrows–Wheeler aligner (BWA) (2)/ Genome Analysis Toolkit (GATK) (3) best-practices pipeline (4), against our PEMapper/PECaller pipeline (5), as well as against a commercially available, but un–peer-reviewed method called GENALICEMAP (genalice.com). This test was conducted in an interesting fashion, limiting the...

متن کامل

PEMapper and PECaller provide a simplified approach to whole-genome sequencing.

The analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here, we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal ...

متن کامل

Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms

SUMMARY An ultrafast DNA sequence aligner (Isaac Genome Alignment Software) that takes advantage of high-memory hardware (>48 GB) and variant caller (Isaac Variant Caller) have been developed. We demonstrate that our combined pipeline (Isaac) is four to five times faster than BWA + GATK on equivalent hardware, with comparable accuracy as measured by trio conflict rates and sensitivity. We furth...

متن کامل

Variant Callers for Next-Generation Sequencing Data: A Comparison Study

Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Proceedings of the National Academy of Sciences of the United States of America

دوره 114 40  شماره 

صفحات  -

تاریخ انتشار 2017